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EXPRESSING  COMPLEX  PARALLEL  ALGORITHMS  IN  DINO 


Matthew  Rosing,  Robert  B.  Schnabel,  and  Robert  P.  Weaver 
University  of  Colorado  at  Boulder 


Abstract.  DINO  is  a  language,  consisting  of  additions  to  C,  for  expressing  parallel  numerical  programs 
on  distributed  memory  multiprocessors.  Its  goal  is  to  incorporate  the  high  level  features  of  parallel  algo¬ 
rithms,  such  as  the  mapping  of  data  and  procedures  to  processes,  into  the  language,  and  have  low  level 
operations  such  as  interprocess  communication  and  process  control  result  implicitly.  This  paper 
describes  the  use  of  DINO  to  program  a  moderately  complex,  multiple-phase  parallel  algorithm,  the 
parallel  solution  of  block  bordered  systems  of  linear  equations.  This  example  illustrates  the  suitability  of 
DINO  for  such  computations  but  also  points  to  some  potential  improvements  to  DINO  that  could  be 
made. 

1.  Introduction.  DINO  (Distributed  Numerically  Oriented  language)  is  a  language  for  writing  numerical 
programs  for  distributed  memory  multiprocessors.  It  consists  primarily  of  standard  C  augmented  by 
several  high-level  parallel  constructs.  Its  goal  is  to  make  the  programming  of  structured,  parallel  numeri¬ 
cal  algorithms  natural  and  easy,  while  avoiding  the  low  -level  details  involved  in  managing  processes  and 
interprocess  communication.  Descriptions  of  DINO  can  be  found  in  [5, 6].  Related  work  can  be  found  in 
[3]  [4]  and  [2]. 

We  believe  that  DINO  gives  the  programmer  who  is  trying  to  express  parallel  scientific  computations  a 
relatively  natural  way  to  write  such  programs  so  that  they  are  easily  written  and  understood  and  execute 
reasonably  efficiently.  But,  it  is  not  yet  clear  that  DINO  is  as  useful  for  large,  complicated  problems  as 
we  have  found  it  to  be  for  smaller  examples.  For  good  reasons,  much  of  the  initial  development  work  and 
most  of  the  explanation  of  a  new  language  is  done  with  small,  manageable  examples.  In  this  paper,  we 
describe  a  considerably  larger  example  than  we  have  in  the  past,  and  discuss  the  strengths  and  weaknesses 
of  DINO  that  this  example  illustrates. 

Section  2  of  this  paper  gives  a  brief  informal  description  of  DINO.  Section  3  presents  a  DINO  program 
for  solving  block-bordered  systems  of  linear  equations,  and  provides  commentary  on  the  features  of 
DINO  that  this  example  utilizes.  Section  4  gives  a  brief  critique  of  DINO’s  effectiveness  in  expressing 
this  algorithm. 

2.  DINO  Language  Overview.  When  programming  in  DINO,  the  programmer  first  defines  a  virtual  paral¬ 
lel  machine  that  best  fits  the  major  data  structures  and  communication  patterns  of  the  algorithm.  In 
DINO,  this  arrangement  of  processors  is  called  a  structure  of  environments.  Second,  the  programmer 
specifies  the  way  that  data  structures  will  be  distributed  among  the  virtual  processors.  This  is  called  distri¬ 
buted  data.  Last,  the  programmer  provides  algorithms  that  will  run  on  each  processor  (each  environment 
in  the  same  structure  of  environments  contains  the  same  algorithm).  In  DINO,  these  are  called  composite 
procedures.  Parallelism  results  from  simultaneous  execution  of  all  the  copies  of  an  algorithm  in  a  struc¬ 
ture  of  environments.  We  now  discuss  each  of  these  features  briefly. 

The  programmer  defined  structure  of  environments  is  the  underlying  parallel  model  in  DINO.  Typically, 
the  structure  is  a  single  or  multiple  dimensional  array,  thus  defining  a  virtual  parallel  machine  with  this 
topology.  Each  environment  in  a  given  structure  consists  of  (identical)  data  structures  and  procedures,  and 
is  similar  to  a  process.  An  environment  may  contain  multiple  procedures,  but  only  one  procedure  in  an 
environment  may  be  active  at  a  time.  As  the  example  in  Section  3  will  illustrate,  a  DINO  program  may 
contain  more  than  one  structure  of  environments. 
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The  key  to  constructing  a  parallel  DINO  program  is  the  mapping  of  data  structures  to  the  structure(s)  of 
environments.  DINO  encourages  the  programmer  to  think  of  data  structures  that  are  operated  on  con¬ 
currently  as  single  data  structures  that  are  broken  up,  distributed  to  multiple  processors  for  computation, 
and  then  reassembled.  The  programmer  does  this  by  declaring  a  data  structure  as  distributed  data  and 
providing  its  mapping  to  the  structure  of  environments.  Individual  data  elements  may  be  mapped  to  only 
one  environment,  or  they  may  be  mapped  to  multiple  environments.  These  mappings  determine  how  the 
environments  will  access  and  share  the  data.  DINO  provides  an  extensive  library  of  standard  mapping 
functions,  and  the  user  can  create  additional  mapping  functions. 


Distributed  variables  are  the  key  to  making  interprocess  communication  in  DINO  natural  and  implicit. 
One  way  this  occurs  is  by  using  distributed  variables  as  parameters  to  composite  procedures;  the  parame¬ 
ter  is  distributed  to  or  collected  from  the  appropriate  environments  according  to  its  mapping  function. 
The  second  way  is  to  use  a  distributed  variable  in  an  expression.  A  local  access  to  a  distributed  variable, 
which  uses  standard  syntax,  affects  just  the  local  copy  and  is  the  same  as  any  standard  reference  to  a  vari¬ 
able.  A  remote  access,  which  uses  a  #  following  the  variable  name,  is  used  to  generate  interprocess  com¬ 
munication.  A  remote  assignment  to  a  distributed  variable  generates  a  message  that  is  sent  to  other 
environment(s)  to  which  that  variable  has  been  distributed,  while  a  remote  read  of  a  distributed  variable 
will  receive  such  a  message  and  update  the  variable’s  value.  In  the  default,  synchronous  mode,  a  remote 
read  overwrites  the  local  copy  of  the  distributed  variable  with  the  first  value  that  has  been  received  since 
the  last  remote  read;  if  no  new  value  is  present,  it  blocks  until  one  is  received.  An  asynchronous,  non- 
blocking  variant  also  is  provided  but  is  not  discussed  in  this  paper. 

A  composite  procedure  is  a  set  of  identical  procedures,  one  residing  within  each  environment  of  a  struc¬ 
ture  of  environments,  that  are  executed  concurrently.  The  parameters  of  a  composite  procedure  typically 
include  distributed  variables.  Composite  procedures  are  called  from  the  main,  host  environment  that  is 
part  of  every  DINO  program.  A  composite  procedure  call  causes  an  instance  of  the  procedure  to  execute 
in  each  environment  of  the  structure,  utilizing  the  portion  of  the  distributed  parameters  that  are  mapped  to 
its  environment.  This  results  in  a  single  program,  multiple  data  form  of  parallelism.  There  is  some  sup¬ 
port  for  functional  parallelism  in  DINO,  but  this  is  not  discussed  in  this  paper. 


3.  Example.  The  example  program  we  describe  computes  the  solution  to  a  set  of  linear  equations 
(Ax -  f)  where  the  system  has  a  block  bordered  structure  of  the  form 
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Here  A;  €  Rnx\  Bt  e  Rnxm ,  Q  e  Rmxn,  P  e  Rmxm ,  XiJi  e  Rn  (1  <i<q),  xq+lfq+l  e  Rm .  Block  bordered 
systems  arise  very  commonly  in  parallel  and  sequential  algorithms  for  solving  problems  in  circuit  design, 
structural  analysis,  and  other  areas  (see  e.g.  [1]). 


The  solution  to  (3.1)  is  computed  from  the  equations 
AiXi  +  BlXq+i  =  fi  1  <i  <q 
and 

ItjCiXi  +  Pxq+i  —fq+ 1 

Solving  the  first  equation  for*,-  and  substituting  this  into  the  second  gives 
(P  -IjCMi-'Bilx, -  tCiAr'fi 
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from  which  xq+\  can  be  derived,  and  put  back  into  the  first  equation  to  find  the  xt  ’s. 
The  algorithm  for  finding  the  solution  consists  of  five  steps: 

1)  factor  each  A,- , 
compute  z i  =  Arlfi  ,  and 
compute  Wi  =  ArlBi  1  <z  <q 

2)  form  J  =P  -ffiWi  and 

compute  b  =fq+l~t[Cizi 

3)  decompose  J  into  LU  form 

4)  solve  Jxq+i  =  b 

5)  compute  xi  =  -Wixq+i-zi ,  l<i  <q 


The  parallelism  in  the  first  and  last  step  is  fairly  straight  forward  and  suggests  a  virtual  machine  consist¬ 
ing  of  q  processors.  In  the  second  step  the  formation  of  each  QWi  can  be  done  in  parallel  and  suggests 
the  same  virtual  machine.  The  solution  of  the  smaller  system  in  steps  3  and  4  can  either  be  performed  on 
one  machine  or  distributed  across  m  processors,  depending  on  its  size.  In  our  example  we  will  distribute 
the  solution  of  the  smaller  system,  using  a  second  virtual  machine  with  m  processors. 

A  DINO  program  for  the  above  algorithm  is  given  in  Section  5.  In  the  remainder  of  this  section  we  dis¬ 
cuss  this  program,  and  use  it  to  illustrate  interesting  aspects  of  the  syntax  and  semantics  of  DINO.  Our 
discussion  proceeds  from  the  highest  level  of  abstraction,  the  virtual  machines,  to  the  next  level,  the 
definition  of  the  composite  procedures  and  distributed  data  structures  that  are  mapped  onto  these  virtual 
machines,  and  finally  to  the  lowest  level,  the  statements  which  use  these  procedures  and  data  structures. 

The  virtual  machines  for  this  algorithm  are  defined  by  the  environment  constructs  on  lines  5,  61  and  129. 
Line  5  defines  a  virtual  machine,  or  environment  structure,  consisting  of  M  processors  named  solve  [0]  to 
solve  [M-l]  each  of  which  contains  the  data  and  procedures  between  lines  6  and  59.  These  environments 
are  used  to  implement  steps  3  and  4  of  the  algorithm.  A  second  set  of  environments,  the  node  environ¬ 
ments,  are  defined  in  a  similar  fashion  arid  are  used  to  implement  the  parallel  aspects  of  steps  1,  2,  and  5 
of  the  algorithm.  Line  129  defines  the  host  environment  which  is  a  single,  controlling  environment  where 
execution  of  the  program  begins.  On  a  distributed  memory  multiprocessor,  both  the  solve  environments 
and  the  node  environments  would  be  partitioned  evenly  among  the  processors  (e.g.  if  M  =  Q  =  the  actual 
number  of  processors,  then  each  processor  would  contain  one  copy  of  each  environment)  and  the  host 
environment  would  reside  on  the  host  processor  (assuming  such  an  arrangement,  as  exists  on  current 
hypercubes). 

The  host  environment  includes  the  procedure  main,  which  is  the  first  procedure  to  be  called.  The  first  and 
last  statements  in  main  initialize  the  data  and  print  the  results.  They  are  executed  (sequentially)  on  the 
host  processor  since  that  is  where  these  procedures  are  defined.  The  next  five  statements  in  main  are  com¬ 
posite  procedure  calls  (denoted  by  the  ’#’  at  the  end  of  each),  and  implement  the  five  parallel  steps  of  the 
algorithm  discussed  above.  For  each  of  these  calls,  a  copy  of  the  composite  procedure  is  executed  in 
parallel  on  each  of  the  environments  where  it  is  defined.  For  the  composite  procedures  distju  and 
distjolve ,  M  copies  are  executed  in  parallel  on  the  M  solve  environments,  and  for  the  others  Q  copies 
execute  in  parallel  on  the  Q  node  environments. 

As  composite  procedures  are  to  be  executed  on  remote  nodes  of  the  actual  parallel  machine,  it  is  not  pos¬ 
sible  to  pass  arrays  as  parameters  in  the  normal  C  fashion  (a  pointer  to  the  first  element).  Instead,  DINO 
supports  passing  entire  arrays  between  functions  both  as  parameters  and,  in  the  case  of  local  procedures, 
as  results.  The  syntax  for  describing  the  entire  length  of  an  array  along  one  axis  is  the  ’[]’  operator.  The 
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description  of  the  entire  array  is  therefore  a  ’[]’  operator  for  each  axis.  Slices  and  subsections  are  also 
possible  and  will  be  described  later  on. 

The  definitions  of  the  composite  procedures  start  on  lines  6,  38,  88,  101  and  118.  A  composite  procedure 
is  similar  to  a  normal  C  procedure.  An  important  difference,  consistent  with  the  style  of  passing  arrays 
discussed  above,  is  that  composite  procedures  can  have  value,  result,  and  value-result  parameters, 
whereas  parameters  to  normal  C  functions  are  always  passed  by  value.  For  example,  in  the  procedure 
formjb  the  parameters  C  and  fqp  are  passed  by  value,  b  is  passed  by  result  and  P  is  passed  by  value- 
result.  These  modifications  seem  necessary  to  support  parameter  passing  in  a  distributed  memory  environ¬ 
ment. 

In  order  for  composite  procedures  to  operate  on  different  sections  of  data,  composite  procedure  parame¬ 
ters  and  other  data  structures  must  be  distributed  over  the  environments  on  the  virtual  machines.  This  dis¬ 
tribution  is  specified  by  the  mapping  function  in  distributed  data  declarations.  For  example,  in  line  89,  >1 , 
the  set  q  of  diagonal  blocks  A\  •  •  •  Aq ,  is  defined  as  a  three  dimensional  array  which  is  distributed  based 
on  the  programmer  defined  slice  (line  63)  mapping  function.  This  mapping  function  partitions  an  array 
along  its  first  axis  only,  thus  placing  the  two  dimensional  array  A  on  environment  node  [/].  The  mapping 
is  readily  constructed  by  using  the  DINO-supplied  block  and  compress  mappings  along  the  first,  second 
and  third  axes  respectively.  Similarly,  the  mapping  of  /  on  line  90  places  N  consecutive  elements  of  /  on 
each  environment,  and  the  mapping  function  byRow ,  defined  in  line  1,  distributes  a  two  dimensional  array 
row-wise  over  a  vector  of  environments.  A  mapping  function  to  distribute  a  matrix  column-wise  could  be 
defined  using  map  byCol  =  [compress ][block]. 

The  place  in  the  program  where  data  is  declared  depends  on  where  it  is  to  be  used,  and  follows  standard 
scoping  conventions  of  block  structured  languages.  For  example  the  intermediate  value  z  is  declared 
within  the  node  environment  (line  66)  and  its  scope  is  this  structure  of  environments.  Since  it  is  never 
needed  by  the  host  environment  it  is  neither  stored  there  nor  passed  back  and  forth  as  a  parameter.  Its 
contents  are  valid  for  the  duration  of  the  program.  An  interesting  case  is  the  last  part  of  the  x  vector,  xqp . 
It  too  is  an  intermediate  value  but  must  be  remapped  between  the  fourth  and  fifth  steps  of  the  algorithm. 
In  the  fourth  phase,  where  it  is  calculated,  it  is  partitioned  among  the  solve  environments  (renamed  x).  In 
the  fifth  phase,  where  it  is  required  to  compute  eachxt,  it  is  replicated  among  the  node  environments.  This 
is  indicated  by  the  parameter  definitions  of  x  in  dist_solve  and  xqp  in  compute jd ,  and  the  use  of  xqp  as  an 
out  parameter  in  distjolve  and  an  in  parameter  in  compute  xi. 

Finally  we  comment  on  some  of  the  executable  statements  within  the  procedures.  The  composite  pro¬ 
cedures  f  actor _A  and  compute  jd  have  no  interprocess  communication,  are  similar  to  normal  C  functions, 
and  therefore  will  not  be  described  here.  In  the  procedure  formjb ,  however,  interprocess  communica¬ 
tions  is  needed  for  the  computation  of  the  sum  of  QWit  and  the  right  hand  side  of  the  same  set  of  equa¬ 
tions.  The  calculation  of  the  first  sum  occurs  in  line  1 10.  The  global  summation  of  a  set  of  values  is  a  typ¬ 
ical  reduction  operation  in  which  each  process  has  a  value,  in  this  case  C,-  Wi  returned  from  matjnatjnult , 
and  the  sum  of  these  values  is  required.  This  is  provided  by  the  function  gsum  in  line  1 10.  This  is  one  of 
several  reduction  functions  provided  in  the  DINO  language  which  takes  a  value  from  each  of  a  set  of 
environments  and  applies  an  associative  operator  on  them.  The  #  indicates  that  a  remote  operation  is 
being  done.  The  result  of  the  gsum  call,  an  M  by  M  array,  is  assigned  to  the  temporary  matrix  T  and  the 
formation  of  J ,  stored  in  P ,  is  than  completed  by  subtracting  T  from  P .  The  matrix  P  has  more  than  one 
row  mapped  to  each  processor  and  therefore  the  subtraction  is  done  for  each  of  the  rows  mapped  onto  the 
current  environment. 


A  block  mapping  distributes  n  elements  on p  environments  by  placing  nip  elements  on  each  environment. 


The  summation  of  C,-z,-  in  line  113  is  done  in  a  similar  fashion.  Note  that  z,-  corresponds  to  a  range  of 
values  out  of  the  z  array.  In  DINO  the  syntax  for  specifying  a  subsection  is  ’[</,;>]’  indicating  the 
values  from  i  to  j  inclusive. 

The  solution  for^+i  is  done  in  the  composite  procedures  distju  and  dist_solve .  These  take  a  matrix  which 
is  distributed  by  row  and  find  the  solution  of  the  system  of  equations  using  Gaussian  elimination  with  par¬ 
tial  pivoting.  Within  distju  the  pivot  row  is  computed  in  line  16  using  the  reduction  function  gmaxdex 
which  returns  the  index  (second  parameter)  of  the  maximum  value.  This  operation  is  only  executed  over 
values  from  node[diag]  to  the  last  node.  This  is  indicated  by  specifying,  within  curly  brackets,  the 
environments  which  will  participate.2  The  next  statement  swaps  the  rows  if  a  pivot  is  required.  The 
environment  containing  the  pivot  row  first  sends  the  row  to  the  environment  containing  the  diagonal  row. 
This  is  done  by  assigning  row  A  [pivot][]  to  row  A  [diag][].  The  #  indicates  that  a  remote  access  is  taking 
place  and  therefore  a  message  is  sent  to  solve  [ diag ],  the  environment  where  A  [diag]\]  is  mapped,  contain¬ 
ing  die  value  of  A[pivot][].  This  message  is  tagged  such  that  a  corresponding  remote  read  of  A  [diag][]  will 
receive  this  value  and  assign  it  to  A  [diag][].  The  send  is  done  in  line  19  and  the  corresponding  receive  is 
done  in  line  24.  The  pivot  environment  also  sends  the  pivot  row  to  the  rest  of  the  environments  partici¬ 
pating  in  the  operation  as  the  value  prow  which  is  than  used  for  the  reduction  step.  If  no  pivot  is  required 
than  just  the  pivot  row  is  sent  to  the  remaining  participants.  The  next  statement  receives  the  reduction  row 
from  the  environment  which  contains  it.  Note  that  a  remote  read  or  write  to  the  current  environment  is 
equivalent  to  a  local  read  or  write. 

4.  Discussion.  The  example  program  illustrates  that  a  non-trivial  parallel  algorithm  can  be  rather  easily 
implemented  in  DINO.  The  high  level  constructs  in  DENO,  namely  the  virtual  parallel  machines,  distri¬ 
buted  data  mappings,  and  composite  procedures,  allow  the  structure  of  the  parallel  program  to  correspond 
quite  naturally  to  the  underlying  high  level  view  of  the  parallel  algorithm.  For  example,  the  ability  to 
specify  multiple  virtual  machines  was  quite  helpful  in  making  this  example  program  easy  and  natural  to 
express.  These  high  level  constructs  also  enable  the  specification  of  process  control  and  communication 
for  this  program  to  be  significantly  simpler  and  more  natural  than  it  would  be  in  many  other  currently 
available  systems.  For  example,  communications  is  implicitly  generated  for  composite  procedure  calls, 
the  remote  access  of  distributed  data,  and  reduction  operations.  These  constructs  also  allow  the  compiler 
to  ensure  that  the  communications  is  correct  with  respect  to  message  type  and  content.  The  result  is  that 
we  find  the  DENO  program  relatively  easy  both  to  write  and  to  understand. 

This  example  also  helps  indicate  several  aspects  of  the  language  that  could  be  improved  upon.  The  first 
item  has  to  do  with  the  level  of  parallelism  or  granularity.  Often,  it  is  most  natural  to  express  a  parallel 
algorithm  at  a  fine  grain  level  of  parallelism.  For  example,  distju  and  dist_solve  are  written  assuming 
there  is  one  row  per  virtual  processor;  these  procedures  are  considerably  more  cumbersome  to  write  if 
each  process  must  handle  a  block  of  rows.  However,  the  current  implementation  of  DINO  maps  each  vir¬ 
tual  process  (environment)  to  an  actual  process.  If  there  are  many  more  rows  than  processors  then  there 
will  be  many  processes  on  each  processor,  resulting  in  an  inefficient  program.  The  preferred  solution  to 
this  problem  is  to  have  the  programmer  continue  to  write  at  the  natural,  fine  grain  level,  and  to  have  the 
compiler  contract  multiple  environments  into  single  processes  when  there  are  more  processes  than  pro¬ 
cessors.  We  believe  the  structure  of  DINO  will  allow  us  to  do  this  efficiently  in  many  cases. 

The  other  major  area  we  would  like  to  address  is  improved  support  for  complex,  multi-phase  computa¬ 
tions.  The  example  points  out  several  ways  in  the  current  DINO  facilities  for  this  could  be  improved. 


2  Any  remote  operation  has  an  implicit  set  of  environments  on  which  an  operation  takes  place.  This  set  can  be  explicitly  changed  by  speci¬ 
fying,  as  in  this  example,  the  participating  environments.  In  the  case  of  a  remote  read  or  write  the  implicit  set  of  environments  is  the  environ¬ 
ments  on  which  the  data  is  mapped.  In  the  case  of  a  reduction  operator  it  is  the  set  of  environments  within  which  the  operation  takes  place.  In  the 
case  of  a  composite  procedure  call  it  is  the  set  of  environments  in  which  the  procedure  is  defined. 
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One  is  that  the  program  requires  the  solution  of  several  systems  of  linear  equations,  to  find  z,- ,  W ,  and  xqp . 
In  a  serial  program  only  one  function  would  be  written  to  handle  this  but  in  DINO  we  are  required  to 
write  both  a  serial  and  parallel  version.  This  problem  is  related  to  the  contraction  problem  mentioned  ear¬ 
lier. 

A  more  pervasive  need  is  to  be  able  to  nest  composite  procedure  calls  in  a  useful  way.  In  the  example 
main  consists  of  several  consecutive  calls  to  composite  procedures.  The  code  would  be  more  readable  and 
more  useful  for  later  use  if  main  consisted  of  a  single  composite  procedure,  say  block _border ,  which  could 
call  f  actor _A ,  formjb,  and  compute _xi .  Currently  this  can  not  be  done  because  the  fork  and  join  aspects 
of  a  composite  procedure  call  are  tied  to  its  being  called  from  and  returning  to  the  host  environment.  A 
solution  to  this  would  be  to  allow  the  processes  in  a  composite  procedure  to  collectively  come  together 
and  substitute  themselves  with  another  composite  procedure.  This  facility  would  allow  DINO  programs 
to  become  subprograms  in  larger  DINO  programs  with  little  or  no  modification. 

A  related  but  more  subtle  problem  arises  when  various  phases  of  the  algorithm  are  most  naturally 
expressed  at  different  levels  of  parallelism.  For  instance,  most  of  the  example  algorithm  has  Q  fold  paral¬ 
lelism,  which  is  reflected  by  having  Q  node  environments.  However,  for  the  solution  of  xqp ,  the  algo¬ 
rithm  has  M  fold  parallelism  which  is  reflected  by  using  M  solve  environments.  DINO  currently  handles 
this  problem  by  moving  J  (stored  in  P)  from  one  set  of  environments  to  another  between  phases.  This  has 
two  disadvantages.  First,  it  is  inefficient  because  of  the  extra  communication  that  is  required  and  the  over¬ 
head  of  extra  environments.  Second,  the  solution  to  the  smaller  system  of  equations,  implemented  by  cal¬ 
ling  distju  and  dist_solve ,  is  logically  a  sub  part  of  the  function  formjb  and  therefore  should  not  be 
invoked  within  main  but  should  be  invoked  at  the  end  of  form Jb .  The  substitution  mechanism  proposed 
in  the  previous  paragraph  will  also  solve  this  problem  as  long  as  we  allow  the  substitute  composite  pro¬ 
cedure  to  use  a  different  structure  of  environments. 

A  final  problem  concerning  the  different  phases  of  an  algorithm  has  to  do  with  the  remapping  of  data. 
Note  that  xqp  is  distributed  during  distjolve  and  is  replicated  for  compute _xi .  In  order  to  accomplish  this 
remapping  it  must  be  passed  up  to  main  and  back  down  to  computejd.  The  ability  to  remap  data  without 
passing  it  back  to  the  host  would  be  more  useful. 

The  problems  mentioned  above  are  either  instances  where  the  ease  of  expression  in  DINO  leads  to  sub- 
optimal  efficiency,  or  where  its  support  for  large  scale  program  development  seems  sub-optimal. 
Although  we  feel  that  it  is  still  easier  to  write  an  efficient  distributed  parallel  program  using  DINO  than 
using  many  other  currently  available  systems,  we  would  like  to  improve  upon  these  problems.  One  of  our 
goals  is  to  determine  how  well  a  compiler  might  be  able  to  convert  an  inefficient  but  easy  to  understand 
DINO  program  into  an  efficient  one.  A  second  is  to  explore  the  inclusion  of  a  small  number  of  additional 
language  features  in  DINO. 

5.  Sample  program 

1  map  byRow  =  [block] [compress]; 

2  map  by  Block  =  [block]; 

3  map  byElement  =  [block]; 

4 

5  environment  solve[M:id]  { 

6  composite  dist _lu(A) 

7  double  distributed  A[M][M]  map  byRow; 

8  { 

9  int  diag,  pivot,  i; 

10  double  pval,  mult; 

1 1  double  distributed  prow[M]  map  all; 

12 
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13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 


for  (diag=0;  diag<M;  diag++) 
if  (diag  <=  id){ 

/*find  pivot*/ 

pivot  =  gmaxdex(dabs(A[id]  [diag]),  id)#{  solve[<diag,M-l>]}; 
if  (pivot  !=  diag)  {  /*pivot*/ 

if  (id  ==  pivot)  {  /*send  pivot  row  to  everyone  who  needs*  it/ 
A[diag][]#  =  prow[]#[solve[<diag+l,M-l>]}  =  A[pivot][]; 
A[pivot][]  =  A[pivot][]#[solve[diag]j; 

} 

if  (id  ==  diag){  /*pivot*/ 

Afpivot]  []#  =  A[diag][]; 

A[diag][]  =  A[diag]  []#  [solve[pivot] } ; 

} 

} 

else  if  (id— diag) 

prow[]#(solve[<diag+l,M-l>]}  =  Afpivot] [];  /*  if  not  pivot*/ 
prow[]  =  prow[]#  [solvefpivot] } ;  /*get  pivot  row*/ 
if  (diag  <  id)  {  /*eliminate*/ 

mult  =  Afid] [diag]  /=  -prowfdiag];  /*compute  1*/ 
for  (i=diag+l;  i<M;  i++) 

A[id][i]  +=  mult*prow[i]; 

} 


composite  dist_solve(in  lu,  out  x,  in  b) 

double  distributed  lu[M][M]  map  byRow; 
double  distributed  x[M]  map  byElement; 

double  distributed  b[M]  map  byElement; 

{ 

double  distributed  y[M]  map  byElement; 

int  i; 

/*  solve  ly  =  b  */ 
y  [id]  =  b  [id] ; 
for  (i=0;  kid;  i++) 

y[id]  -=  y[i]#  *  lu[id][i]; 
y  [id]#  { node[<id+ 1  ,M>] }  =  y[id]/lu[id]  [id]; 

/*solve  ux=y*/ 
x[id]  =  y[id]; 
for  (i=M;  i>id;  i--) 

x[id]  -=x[i]#*lu[id][i]; 
x[id]#(node[<0,id-l>]]  =  x[id]/lu[id][id]; 

}; 


} 

environment  node[Q: id]  { 

map  slice  =  [block] [compress] [compress]; 

double  distributed  W[Q][N][M]  map  slice;  /*  =  A‘-l  B*/ 
double  distributed  z[Q*N]  map  byBlock; 

mat_vec_mult(W,v,x)[N]  /*x  <=  Wv*/ 
double  W[N][M],  v[M]; 

{}; 
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71 

72  sub_vec(v,x,l)  /*v  <=  v  -  x,  1  is  length  of  vectors*/ 

73  double  vQ,  x[]; 

74  {}; 

75 

76  double  mat_mat_mult(A,B)[M][M]  /*retums  AB*/ 

77  double  A[M][N],  B[N][M]; 

78  {}; 

79 

80  seq_lu(A)  /*does  lu  decomposition  on  A*/ 

81  double  A[N][N]; 

82  {}; 

83 

84  seq_solve(lu,  x,  b)  /*solves  lu  x  =  b*/ 

85  double  lu[N][N],x[N],b[N]; 

86  {}; 

87 

88  composite  factor_A(in  A,  in  f,  in  B) 

89  double  distributed  A[Q]  [N]  [N]  map  slice; 

90  double  distributed  f[Q*N]  map  byBlock; 

91  double  distributed  B[Q][N][M]  map  slice; 

92  { 

93  int  i; 

94 

95  seq_lu(A[id]);  /* decompose  each  block*/ 

96  seq_solve(A[id],  &z[id*N],  &f[id*N]);  /*z  =  A  inv  f*/ 

97  for  (i=0;  i<M;  i++)  /*W  =  A  inv  B*/ 

98  seq_solve(A[id],  W[id][i],  B[id][i]); 

99  }; 

100 

101  composite  form_Jb(P,  in  C,  in  fqp,  out  b) 

102  double  distributed  P[M][M]  map  byRow; 

103  double  distributed  C[Q][M][N]  map  slice; 

104  double  distributed  fqp[M]  map  byBlock; 

105  double  distributed  b[M]  map  byBlock; 

106  { 

107  double  tz[N],T[M][M]; 

108  int  i; 

109 

110  T[][]  =  gsum(mat_mat_mult(C[id],  W[id]))#;  /*C[id]  <=  Ctid]*^-1  [id]*B[id]*/ 

111  for  (i=id*M/Q;  i<  id*M/Q  +  M/Q;  i++) 

1 12  sub_vec(P[i] ,  T[i]  [] ,  M);  /*compute  J  (in  P)*/ 

1 13  tz[]  =  gsum(mat_vec_mult(C[i],  z[<id*N,  id*N+N-l >]))#;  /*  sum  z*/ 

1 14  for  (i=id*M/Q;  i<  id*M/Q  +  M/Q;  i++) 

1 15  b[i]  =  fqp[i]  -  tz[i];  /*b  =  fqp  -  sum  z */ 

116  }; 

117 

1 18  composite  compute_xi(in  xqp,  out  x) 

1 19  double  distributed  xqp[M]  map  all; 

120  double  distributed  x[Q*N]  map  byBlock; 

121  { 

122 

123  mat_vec_mult(W[id],  xqp,  &x[id*N]); 

124  neg_vec(&x[id*N]); 

125  sub_vec(&x[id*N],  &z[id*N],  N); 

126  }; 

127  } 

128 
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129  environment  host{ 

130  init_data() 

131  {}; 

132 

133  print_results() 

134  {}; 

135 

136  double  A[Q][N][N],  B[Q][N][M],  C[Q][M][N],  P[M][M], 

137  fqp[M],  f[Q*N],  xqp[M],  x[Q*N],  b[M]; 

138  main(){ 

139  init_dataO;  /initialize  A  B  C  P  and  f*/ 

140  factor_A(  A[][][],  fQ,  B  [][][])#; 

141  form_Jb(  P[][],  C[][]Q,  fqp[],  b[])#; 

142  dist_lu(  P[]  [])#; 

143  dist_solve(  P[][],  xqpQ,  b[])#; 

144  compute_xi(xqp[],  x[])#; 

145  print_results(); 

146  } 

147  } 
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